Boston fork: security hardening, upstream-portal protection, AWS hosting docs by brendanbabb · Pull Request #1 · codeforanchorage/OpenContext

brendanbabb · 2026-04-16T21:04:28Z

Summary

Two phases of work, all on this branch:

Phase 1 — Infra/security/docs (original PR scope)

Infra — us-west-2 region, reserved Lambda concurrency as a hard cap on upstream fan-out, cp311/manylinux Lambda packaging, S3 backend config extracted out of the public repo.
Security — allowlist-only AST-aware SQL validator + aggregate_data SQL builder; 64 KB request body cap with 413 before JSON parsing.
Docs — docs/AWS_DEPLOYMENT.md, docs/SECURITY.md, stdio_bridge.py, CLAUDE.md.

Phase 2 — GPT-4o tool-chaining + CKAN ergonomics (added 2026-04-30)

Real-world failures observed via the deployed Lambda informed each fix:

GPT-4o chaining (2f3c09f) — fattened tool descriptions with explicit "Next step" hints; suggested_resource_id surfaced at top of search/dataset output; new composite ckan__search_and_query tool that internally chains search_datasets + query_data.
datastore_active filtering (ddcee2e, 668e24f) — Boston datasets attach 5–7 download-only resources (GeoJSON, KML, SHP, PDF) plus a single CSV that's actually loaded into the datastore. Plugin now picks the queryable one and labels each resource [QUERYABLE] vs [DOWNLOAD-ONLY]. Fixes the parks query that 404'd.
Structured where clauses (5825317) — new where argument on query_data and search_and_query supporting eq, ne, gt, gte, lt, lte, in, not_in, like, ilike, is_null via SafeSQLBuilder. Routes through datastore_search_sql when set. Fixes "311 closed on 2026-04-29" (returns 85, the true count).
Schema footer — every successful row response ends with a Filterable columns block listing field names + types so the next pivot is a one-shot.
Sibling resources + resource_name (c1608e2) — Boston's 311 dataset has 22 queryable CSVs (rolling NEW SYSTEM + per-year archives 2011–2026); the model couldn't see them. Sibling block now lists them; new resource_name argument picks one by case-insensitive substring match (resource_name="2020" → 311 SERVICE REQUESTS - 2020).
Limit-truncation guards (d9b6fca, 47fbdca) — model was reading "Found 100 record(s)" as a count when 100 was the LIMIT. CKAN's total (from datastore_search) and count (from package_search) now surfaced. SQL path does a COUNT(*) follow-up when len(records) >= limit. execute_sql extracts the effective LIMIT and warns when truncated. Phrasing switched to "100 of 531 rows returned" everywhere. Fixes "311 closed on 2016-04-29" (now reports 531 instead of the limit-100 misread).
Lint (a8e9fea) — ruff F541.

Why

Same ground rule as Phase 1: don't overwhelm data.boston.gov. Phase 2 layers on a different goal — make the tools usable enough that GPT-4o (which is weaker at multi-step chaining than Claude) can answer real questions without fumbling. Each fix landed in response to a specific observed failure on the deployed prod Lambda.

Test plan

Test suite: 212 passing (was 81 pre-Phase-2)
Code quality: ruff clean after a8e9fea
Live verification on deployed prod Lambda (boston-opencontext-mcp-prod in us-west-2):
- search_and_query("parks") → Park_Features CSV, real park rows
- search_and_query("311", where={close_date: {gte: "2026-04-29", lt: "2026-04-30"}, case_status: "Closed"}) → 85 (matches manual SQL count)
- search_and_query("311", resource_name="2016", where={closed_dt: {gte: "2016-04-29", lt: "2016-04-30"}}) → 100 of 531 rows returned + TRUNCATED block telling the model "the answer is 531, NOT 100"
- search_datasets("parks", limit=5) → 5 of 21 matching dataset(s) shown

🤖 Generated with Claude Code

… pin Move the deployment to us-west-2, add reserved Lambda concurrency as the primary brake on fan-out into the upstream CKAN portal, and pin Lambda packaging to cp311/manylinux wheels so the ZIP works regardless of build host Python version. - terraform/aws: add lambda_reserved_concurrency (default 10) wired to aws_lambda_function.reserved_concurrent_executions. Extract the S3 backend config out of main.tf; real backend.tf is gitignored because the bucket name embeds the deployer's AWS account ID. Ship backend.tf.example as the template. - prod/staging tfvars: aws_region=us-west-2, api_quota_limit=3000 (was 1000), lambda_reserved_concurrency=10. Prod custom domain is boston-data.codeforanchorage.org; staging has no custom domain. - scripts/deploy.sh + .github/workflows/release.yml: force cp311 manylinux wheel resolution on every pip/uv install (without this, a Python 3.14 build host produces a ZIP that 502s at Lambda cold start). Detect python3/python cross-platform. Build the ZIP with stdlib zipfile instead of the `zip` binary so the packaging step works on CI images and Windows. - scripts/setup-backend.sh: fix malformed bucket name (boston-opencontext-opendataterraform-state-... → boston-opencontext- tfstate-...). - config.yaml: replace symlink-to-example with a concrete Boston CKAN config targeting data.boston.gov. ArcGIS kept disabled for reference. - local_server.py: accept POSTs on both / and /mcp so the same local server works with Claude Desktop stdio bridges and MCP Inspector. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…size Tighten the two attack surfaces that directly forward user-controlled input into upstream CKAN: the execute_sql path and the aggregate_data path. Add a request-body-size cap on the HTTP handler to bound the work a single JSON-RPC call can cost. See docs/SECURITY.md for the full threat model. - plugins/ckan/sql_validator.py * SQLValidator: shrink MAX_SQL_LENGTH 50000 → 8192; strip SQL comments before keyword/function scans so /* ... */ and -- ... obfuscation can't smuggle forbidden tokens past the checks; expand FORBIDDEN_KEYWORDS with PREPARE/COPY/LISTEN/NOTIFY/VACUUM/ ANALYZE/CLUSTER/REINDEX/LOAD/DO; add FORBIDDEN_FUNCTIONS (xp_cmdshell, pg_sleep, pg_read_file, pg_ls_dir, pg_stat_file, lo_import, lo_export, current_setting, set_config, dblink); walk the sqlparse AST to require every FROM/JOIN target to be a UUID-quoted resource or a CTE alias (rejects schema-qualified targets like pg_catalog.pg_class); match INTO OUTFILE/DUMPFILE. * New enforce_row_limit: appends LIMIT 10000 to any validated SQL that lacks a top-level LIMIT so a caller can't trigger an unbounded scan on a multi-million-row CKAN DataStore table. * New SafeSQLBuilder: typed, allowlist-only builder for the aggregate_data path. Identifiers must match [A-Za-z_]\w*, metric expressions must be count(*) or {count|sum|avg|min|max|stddev} ([DISTINCT] <ident>), filter values coerced per type with ' escaping, order_by parsed and quoted, limit clamped to 10000, HAVING values must be numeric. - plugins/ckan/plugin.py: route aggregate_data through SafeSQLBuilder (was string concatenation); call SQLValidator.enforce_row_limit after validate_query. - server/http_handler.py: reject JSON-RPC bodies > 65 KB with HTTP 413 before parsing. The MCP surface fits in a few KB; a megabyte payload is either a bug or abuse. - tests: cover body-size cap at and over the boundary, each new forbidden keyword/function, comment obfuscation, schema-qualified FROM rejection, enforce_row_limit behavior, and every SafeSQLBuilder method. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Document the Boston fork's AWS hosting and security posture, with portal protection as the top design constraint. Add stdio_bridge.py as a Python alternative to the Go stdio client for Claude Desktop. - docs/AWS_DEPLOYMENT.md: how this fork is hosted (us-west-2, custom domain, reserved concurrency, cp311 packaging), what changed vs. upstream's single-region default, and how to operate the stack. Leads with the portal-protection design constraint. - docs/SECURITY.md: the full rationale behind the hardening changes, organized around who is being protected — upstream portal first, deployment second, end users third. Covers the SQL validator and SafeSQLBuilder surface, rate limits and body-size cap, privacy posture (stateless, no PII, 14-day log retention, SQL truncated to 500 chars), and known gaps. - README.md: link both new docs from the documentation table. - stdio_bridge.py: minimal Python stdio-to-HTTP bridge. Reads JSON-RPC messages from stdin, POSTs them to the local/remote MCP server, writes responses to stdout. Useful where the Go client is impractical (Windows, no Go toolchain). - CLAUDE.md: repo guidance for Claude Code sessions — commands, request flow, architecture notes, single-plugin rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds Method 4 to docs/TESTING.md covering how to wire the local HTTP server to Claude Desktop (claude_desktop_config.json) and Claude Code (.mcp.json) via stdio_bridge.py. The bridge was previously only mentioned in CLAUDE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ce_id, search_and_query) GPT-4o struggles to extract a resource UUID from one tool result and pass it to the next call. This patch makes the CKAN plugin friendlier to weaker multi-step tool callers without changing existing tool semantics: - Tool descriptions now end with explicit "Next step" guidance naming the follow-up tool (ckan__query_data, ckan__get_dataset, etc.). - Per-parameter descriptions spell out provenance ("the `id` field inside the `resources` array returned by ckan__search_datasets"). - search_datasets / get_dataset formatted output leads with a NEXT STEP block exposing suggested_resource_id and a literal suggested_call line. - New composite tool ckan__search_and_query that chains search + query server-side, eliminating the cross-call ID-passing problem entirely. Tests: tool count 6 -> 7, three new cases for search_and_query happy path, no-match, and dataset-with-no-resources. Full CKAN suite passes (32/32).

Boston attaches each dataset as 5-7 resources (GeoJSON, KML, SHP, PDF, ArcGIS REST, CSV) but only the CSV is loaded into the queryable Postgres datastore. The previous code blindly handed the model `resources[0].id`, which is the GeoJSON one for most parks/permits/etc. datasets, and the caller then got a 404 from datastore_search. Fixes: - New _is_queryable / _first_queryable_resource helpers. - search_datasets formatter: suggested_resource_id and per-dataset resource_id lines now point at the datastore_active resource only. When no resource is queryable, say so explicitly instead of suggesting a doomed UUID. - get_dataset formatter: NEXT STEP block points at the queryable resource; resource list labels each as [QUERYABLE] or [DOWNLOAD-ONLY] and surfaces the download URL for non-queryable ones. - search_and_query: walks search results and resources to find one with datastore_active=true (configurable via dataset_index/resource_index). Falls through to the next dataset rather than 404-ing on a download- only first match. - query_data: 404s now append a hint about the datastore_active gotcha pointing at get_dataset / search_and_query so the model doesn't keep retrying the same UUID. Tool descriptions updated to call out the queryable-vs-download distinction. Tests: 35 passing (was 32). Three new regressions: - parks-style multi-resource shape skips GeoJSON and queries CSV - walks to the next dataset when first has no queryable resource - query_data 404 includes the datastore_active hint Live smoke test against data.boston.gov confirms ckan__search_and_query "parks" now returns real Park_Features rows (Iacono/Readville Playground, East Boston Memorial Park, etc.) instead of 404.

…ource_id

Real failure observed: model asked for "311 service requests closed on 4/29", called ckan__search_and_query without filters, and got 10 unfiltered rows (almost none closed on 4/29; 85 actually were). Two design problems: 1) `query` matches dataset metadata, not row content — but the model read it as a row filter. 2) `filters` is equality-only against datastore_search, so it can't express "close_date in [4/29, 4/30)" even if used. Fixes (A + B + C from the proposal): A) Tool descriptions explicitly distinguish dataset-metadata search from row filtering and steer to the right knob (filters / where / execute_sql) with concrete examples. B) Every successful query_data / search_and_query response now ends with a "Filterable columns" footer listing field names + types, pulled from the datastore response. The next pivot to where / execute_sql becomes a one-shot. C) New structured `where` argument on query_data and search_and_query. Routes through datastore_search_sql via SafeSQLBuilder; supports eq, ne, gt, gte, lt, lte, in, not_in, like, ilike, is_null. Strings are length-capped (256) and quote-escaped; IN-lists capped (100); identifiers validated as before. Live verification: search_and_query("311", where={close_date: {gte: "2026-04-29", lt: "2026-04-30"}, case_status: "Closed"}) now returns exactly 85 rows — matching ground truth from a manual SQL count. Tests: +26 new (170 -> 196). Coverage: - where-clause builder for every operator + edge case (injection escaping, bad ops/types, oversized lists/strings, unknown operators) - query_data routes through datastore_search_sql when `where` set - query_data validation errors short-circuit before any API call - schema footer surfaces in both SQL and non-SQL paths

Boston's 311 dataset attaches 22 queryable CSV resources (a rolling 'NEW SYSTEM' view plus per-year archives 2011–2026). The model only ever saw whichever the auto-pick chose first — typically the rolling view — which silently dropped older data on the floor for historical questions. Two improvements: (1) `_format_search_and_query` now emits an "Other queryable resources in this dataset" block listing every datastore_active resource other than the chosen one (name, format, resource_id), so the model can see that '311 - 2020', '311 - 2019', etc. exist. (2) New `resource_name` argument on search_and_query: case-insensitive substring match against resource `name`. Takes precedence over `resource_index`. With dataset_index pinned, a no-match returns a clean error listing the available names. Selection precedence (high → low): resource_name > resource_index > first datastore_active resource Tool description updated with concrete example (resource_name='2020' picks the 2020 archive). The 'Other queryable resources' block tells the model these names exist; the description tells it how to use them. Live verification: search_and_query("311", resource_name="2020") now queries 311 SERVICE REQUESTS - 2020 (resource 6ff6a6fd-...) and returns 2020 records. Default no-resource_name query still picks the rolling NEW SYSTEM and surfaces all 21 archive siblings in the response. Tests: +3 (38 -> 41) covering resource_name match, no-match error, and the siblings block (queryable-only, excludes download-only resources).

Real-world failure: bot answered "How many 311 requests closed on 4/29/2016?" with 100 — but 100 was the LIMIT; the true count is 531. The model was reading "Found 100 record(s) (showing up to 100)" as the answer to a counting question. Fixes: 1. Capture CKAN's `total` from datastore_search responses (it's already returned by CKAN; we were discarding it). Plumbed through _query_with_schema → ToolResult → formatter. 2. SQL (`where`) path: when SELECT * hits the limit exactly we don't know the true total, so issue a cheap SELECT COUNT(*) follow-up with the same WHERE. This means we always tell the model a real number when it asks a counting question via search_and_query or query_data, regardless of which path was taken. 3. Formatter rewrite: - Header line is now `total_matching_rows: N (returned K, limit=L)` when total is known and exceeds returned. Removed the misleading "Found 100 record(s)" wording that conflated rows-returned with true count. - When truncation is detected, prepend a `=== TRUNCATED ===` block stating "the answer is N, NOT K" and pointing at ckan__aggregate_data for cheap counts and ckan__execute_sql for custom LIMIT/ORDER BY. - When total can't be determined (count follow-up failed) and rows == limit, prepend `=== MAY BE TRUNCATED ===`. Live verification (boston prod CKAN): - search_and_query "311 closed on 2016-04-29" with default limit=100 → returns the 100 sample rows AND header "total_matching_rows: 531" AND the TRUNCATED warning telling the model the answer is 531. - search_and_query "311 closed on 2026-04-29" with limit=100 → returns 85 rows, no TRUNCATED warning (85 < 100), header reads "total_matching_rows: 85". Tests: +5 (41 -> 46) covering total surfaced from datastore_search, no-warning path when under limit, COUNT(*) follow-up fires only when truncated on the SQL path, and graceful fallback when count fails.

User flagged: limit-as-count confusion isn't unique to query_data. Same trap exists in execute_sql ("returned 100 because the SQL said LIMIT 100") and in search_datasets ("found 20 datasets" while CKAN knows there are actually 47 matches, since package_search returns `count`). Switched to the user-suggested 'X of Y' phrasing throughout. Changes: 1. Shared _format_count_header helper renders three cases: - "100 of 531 rows returned (limit=100; raise limit or use ckan__aggregate_data for full count)." - "85 rows returned (full result, limit=100)." - "100 rows returned (limit=100, true total unknown — see TRUNCATED warning above if any)." Used in query_data, search_and_query, and search_datasets (with unit="matching dataset(s) shown"). 2. search_datasets now reads CKAN's `count` from package_search and surfaces it: "5 of 21 matching dataset(s) shown" instead of "Found 5 dataset(s)". 3. execute_sql now extracts the effective LIMIT from the validated SQL via SQLValidator.extract_top_level_limit() and emits a "MAY BE TRUNCATED" block when len(records) >= effective_limit (datastore_search_sql doesn't return a `total`, so this heuristic is the best we can do). aggregate_data inherits the same guard since it formats results via _format_sql_results. 4. Existing query_data/search_and_query header lines switched to the X-of-Y phrasing for consistency. The TRUNCATED warning text now matches the rest of the corpus. Live verification (boston prod CKAN): - search_datasets("parks", limit=5) → "5 of 21 matching dataset(s) shown (limit=5; raise limit to see more)". - execute_sql with LIMIT 100 returning 100 rows → MAY BE TRUNCATED block at top, "100 rows returned (limit=100, true total unknown...)". Tests: +8 (204 -> 212). New: search_datasets count rendering, two execute_sql truncation cases, four extract_top_level_limit unit tests (simple, semicolon, missing, subquery-ignored), enforce-then-extract round trip.

brendanbabb and others added 12 commits April 16, 2026 13:01

CKAN: alternates list in search_and_query also surfaces queryable res…

668e24f

…ource_id

Lint: drop extraneous f-string prefixes (ruff F541)

a8e9fea

brendanbabb merged commit e9479ca into main Apr 30, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1

Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1
brendanbabb merged 12 commits intomainfrom
boston/security-hardening-and-aws-docs

brendanbabb commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendanbabb commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 1 — Infra/security/docs (original PR scope)

Phase 2 — GPT-4o tool-chaining + CKAN ergonomics (added 2026-04-30)

Why

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brendanbabb commented Apr 16, 2026 •

edited

Loading